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ABSTRACT 

This paper outlines a unified framework for high dimensional variable selection for classifi- 
cation problems. Traditional approaches to finding interesting variables mostly utilize only 
partial information through moments (like mean difference) . On the contrary, in this paper we 
address the question of variable selection in full generality from a distributional point of view. 
If a variable is not important for classification, then it will have similar distributional aspect 
under different classes. This simple and straightforward observation motivates us to quantify 
'How and Why' the distribution of a variable changes over classes through CR-statistic. The 
second contribution of our paper is to develop and investigate the FDR based thresholding 
technology from a completely new point of view for adaptive thresholding, which leads to a 
elegant algorithm called CDfdr. This paper attempts to show how all of these problems of 
detection, extraction and interpretation for interesting variables can be treated in a unified 
way under one broad general theme - comparison analysis. It is proposed that a key to ac- 
complishing this unification is to think in terms of the quantile function and the comparison 
density. We illustrate and demonstrate the power of our methodology using three real data 
sets. 



Keywords and phrases: Comparison density, Mid-distribution function. Variable selection, 
Ranking, Orthogonal Series Density Estimation, Score function, Wilcoxon Statistics, FDR, 
Pre-fiattened smoothing . 



1 Introduction 

Consider a classification problem where we have a large number of predictor variables Xi, X2, . . . , Xp 
and where Y denotes the class labeling. Our goal is to find the most 'influential' variables, and 
more importantly, to gain a deeper insight on the questions : why and when as well as how a 
variable could be influential, through visualization. This extra piece of information can help 
applied researchers not only to identify the needle in a haystack but to classify them according 
to their impact types and thus, provide specific hypotheses for further investigation. 
To capture the interesting variables, first we need to understand how a variable could be 
informative about the classes. For this we consider a few illustrative examples and attempt 
to define exactly what is meant by 'interesting/uninteresting'. Figure 1 gives an idea about 
discriminatory information {how and what) a variable could render and set the stage for our 
method. We now highlight a few key aspect of Figure 1 with regard to variable selection 
methodology 

(a.) Information: Distribution vs. Moments. Figure 1 in panel A, demonstrates the difference 
in location and scale of variable # 14 (Bilirubin) of the Hepatitis data (Diaconis and Efron, 
1983) for the two classes, 'dead' and 'alive'. Note that the variability of Bilirubin is higher 
in the "red" class ('dead') compared to the blue one. Similarly, Panel C on the distribution 
of Gene ^ 610 of the Prostate Cancer data (cf. Section 5) for the two classes ('sick' and 
'health') shows contrasting skewness and considerable departure from normality and Panel D 
on Gene 7^ 332 of the same data set clarifies the existence of contrasting tail behavior and 
the presence of bi-modality, which gives a strong message about the presence of the disease 
; This shows that each variable may exhibit a different kind of information and we want our 
methodology to be flexible enough to capture these effects. So, this automatically raises the 
issue as to how we can possibly identify the variables with different types of information and 
how we should measure those in a coherent manner. Thus we need a methodology that, lets 
the data determine the variables of importance to the greatest extend possible. 
This paper aims to exploit the distributional information of a variable over different classes by 
developing a set of score statistics to quantify that. This is in marked contrast to the current 
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Figure 1: Important Variables are Different for Different reasons : Panels A,B : give is the 
plots of the distributions of Variable # 14 (continuous) and # 1 (categorical) for hepatitis data 
set over two classes Y = 1 ('Dead') and Y = ('Alive'). Panels C and D: show distribution 
of Genes # 610 and # 332 of the Prostate Cancer data, over the two classes Y = 1 ('Cancer') 
and y = ('Healthy'). It shows that the distinguishable discriminatory pattern is different 
for different variables . 
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practice of identifying 'interesting' of variable solely on the basis of a single moment based 
information ( e.g., mean level change). 

Remark 1 (Biological Significance). Not until very recently, biologist have noticed the 
importance of detecting the higher order information in genes. Feinberg and Irizarry (2010) 
and Hansen et al. (2011), have argued and confirmed the importance of detecting genes which 
are differentially expressed in terms of ^^variability" as opposed to the traditional approach of 
discriminating on the basis of the ^^mean" change for colon tumor. This is further confirmed 
by Figure 1, where we can easily see that diseased class exhibit higher variability than the 
normal class for all of the variables. Although most traditional variable selection approaches 
use only partial discriminative information through the first order (i.e., through the mean 
level) to filter the important variables, our method is fiexible enough to detect those genes 
which show increased variability and other specific distributional aspects (e.g., higher order 
moment information) in cancer tissues compared to healthy normal samples and thus, have 
the potential to find a new set of disease-related genes which was not previously anticipated. 

(b.) Nonparametric and Robust. Our approach is fully nonparametric and robust. Further it 
can accommodate missing values (e.g., the Hepatitis data set that we refer to Fig. 1 contains 
lots of missing observations). Another useful quality of our methodology is that it is invariant 
under monotone transformations, as a result of it works perfectly fine without any type of 
normalization techniques (specially for microarray data). The extended long tail of Gene 7^ 
332 in panel C and the omnipresence of noise in the microarray data necessitate robustness 
against departure from normality, which is guaranteed by our method. 

(c.) Unification and Integration. Our methodology provides a mechanism for handling contin- 
uous, discrete and categorical variables (Hepatitis data contains all of them) through an inge- 
nious mid- distribution transformation. This enables us to propose a unified method for finding 
differentially expressed genes under different microarray platforms (DNA microarray technol- 
ogy/Counting based technology like Massively parallel signature sequencing (MPSS)/Next 
generation sequencing technology (RNASeq data)). 

(d.) Characterizing the Information content. Each interesting variable has different informa- 
tion content for different reasons and that is why we seek a method that would shed light 
on how and why a particular variable is important. One of the most important features of 
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our method is its ability to offer additional interpretability and understanding in scientifi- 
cally meaningful terms through visualization. This non-classical property of variable selection 
would help the scientist to get a fuller picture of what is happening. 
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Figure 2: Importance ordering : Each panel plots the value of CR-statistics after sorting. 
Panel A deals with the Hepatitis study where we can see a natural gap, marked using a gray 
dotted line. Panel B: Prostate cancer study. We have only plotted top 100 score statistics for 
clarity. Our CDfdr algorithm presented in the Section 6.2 detects such natural gaps or thus 
selects proper thresholds 



In Sections 2-4 we discuss several key concepts and present a novel algorithm to compute 
and interpret such variable importance measures once we quantify the importance of each 
of the variables, we can rank the variables in a decreasing order as in Figure 2. Section 
5-6, the main part of the paper, describes a simple and straight forward approach for FDR 
corrected threshold selection based on comparison density. It turns out that the threshold 
selection problem is intimately related to the problem of estimating the tail of the ratio of 
two density. In this respect, we have developed an elegant fully nonparametric approach to 
FDR based adaptive thresholding, which we shall refer to as the CDfdr Algorithm. The key 
to accomplishing this also turns out to be expressing the FDR in terms of the comparison 
density, a fundamental concept introduced by Parzen (1979). 



5 



The main observation behind the construction of the proposed CDfdr based variable selection 
procedure is that directly estimating the ratio of two density is more efficient and stable, com- 
pared to estimating them separately and then taking the ratio. We also employ an approach 
based on Pre-flattened smoothing of comparison density which can efficiently capture the be- 
havior of the tail of the ratio of two densities. This connection among the comparison density, 
the false discovery rate and the Pre-flattening approach is new. Our framework for threshold 
selection provides a new perspective that could be considered as a general methodology for 
adaptive threshold selection for recovering signal from its noise corrupted version. 

In summary, we provide a fresh look at the problem of variable selection and address the issue 
using various key concepts from modern quantile based nonparametric methods. In the first 
part (Section 2-4) of the paper, we describe the background concepts and attempt to build 
the detector based on CR-statistic which can potentially quantify the Hnformativeness' of a 
variable. The second part (Section 5-6) is devoted to building the CDfdr algorithm to separate 
the ^significant^ (signal) variables from the noise. We have made a special effort to illustrate 
our concepts through various examples. This is a very difficult task as we are to work with 
thousands of variables where each variable is special in its own right. In this article, we have 
limited ourselves to the independent case while remaining central to the topic at hand. We do 
this primarily to introduce the remarkably powerful foundational ideas, which have enormous 
potential to tackle high dimensional inference from a completely nonparametric and robust 
point of view. 

2 Background Concepts 
2.1 Mid-Distribution Transform 

Definition and Properties. 

Parzen (1989) introduced the seminal concept of a mid-distribution function, which is calcu- 
lated as a transformation of ranks of the tied data. Let X be a random variable with distri- 
bution function F{x) = Pr(X < x) and probability mass function (pmf) p{x) = Pr(X = x). 
The Mid-distribution function is defined as. 




id 



(2.1) 
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Table 1: Mid- Rank transformation of Protime variable (First 10 samples) 



X (Raw Data) 


80 


75 85 


54 


52 


78 


46 


63 85 


62 


u (Mid-Rank) 


0.778 


0.732 0.818 


0.392 


0.357 


0.767 


0.261 


0.534 0.818 


0.511 ... 



When the sample consists of distinct values, (xj) = {Rj — .5)/n, where Rj is the rank 

of Xj in the sample of size n. For the tied cases we use the average rank instead of Rj. Note 
that for continuous random variable X, we have F™*^(X) = F(X) ~ U{0, 1). 
A Small Example. 

Table 1 shows the raw data along with the corresponding mid-rank transformed version for 
first 10 sample values of variable # 18 (Hepatitis data). 

Remark 2 {Why Fmid-Transf ormation). (i) Unification: The notion of mid-rank transforma- 
tion enable us to deal simultaneously with discrete and continuous distribution, (ii) Robust- 
ness : The rank based nature of the Fmid transformation makes it robust, (iii) Invariance 
Property & Normalization: The standard practice for normalizing microarray data is to take 
log-transformation (as a variance-stabilizing transformation) for normalization. But although 
log-transformation stabilizes the variance of high expression labels, it is known that this trans- 
formation increases the variance of the observations near the background. Our Fmid based 
nonlinear transformation is invariant under monotone transformation and we do not need 
to supply the calibrated (biased) microarray expression. Our method can directly work on 
the raw data and produce the same result. From another angle, Fmid based transformation 
could be considered as a universal normalization technique, regardless of the measurement 
error structure. 

How to Compute. 

We will start all our nonparametric statistical data analysis by taking mid-rank based non- 
linear transformation of the raw data x . Transforming the raw data x = (xi, X2, ■ ■ ■ , Xn) into 
the Fmid-domain is simple and straightforward using the rank command in R. 

[rank(x, ties .method = cC'average")) — .5] 
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From now on we will work with the U matrix rather than the X data matrix. 



[Original Raw Data] — )■ [Mid-Rank Transform Data] 



X U 



(2.3) 



2.2 Two Sample Data Analysis & Comparison Density 

Notation. 

To motivate the concept of "Comparison Density" , we consider the two population classifica- 
tion problem. We assume that (Xj, yj), i = 1, 2, . . . , n are independent observations where Y 
is a binary — 1 variable corresponding to the two populations from which we are observing 
the p dimensional random vector X, with corresponding sample sizes no and rii (say). 
We use the notation F{x\Y = y) = Pr(X < x\Y = y) to denote the conditional distribution 
of a single variable X, a generic component of the vector X (e.g., X = Bilirubin level in Panel 
A of Figure 1 ) given Y = y, where y G {0, 1}. For notational simplicity, from now on, set 
F{x\Y = 1) = F{x) (distribution of "red" class) and F{x\Y = 0) = G{x),x G M. Let H{x) 
denote the unconditional cumulative distribution (cdf) function which is given by 



where vr = Pr(F = 1). For simplicity of exposition, suppose that F(-) and G{-) are absolutely 
continuous with quantile functions F~^[-) and G^^(-), respectively. 
Comparison Distribution Function. 

Under the classification setup for variable selection, our main objective is to compare the 
two distributions F and G and to check whether they differ significantly. Based on that, 
we declare whether the variable X is informative or not. To compare two probabilities pi 
and p2 G (0,1) we prefer to use pi/p2 rather than pi — p2. Applying this philosophy for 
the purpose of comparing two distribution functions F{x) and G{x), we propose the ratio 
comparison principle and define concepts of pooled Comparison Distribution function (Parzen, 



H{x) = nF{x) + (1 - n)G{x) 



1983, 1992) as 



D{u; H, F) = F{H-\u)), 0<u<l. 



(2.4) 



Its density, called the Comparison Density function, satisfies 



d{u; H, F) := D\u- H, F) = \' ' '\ 

7Tf{H ^{u)) + {l-TT)g{H i(m)) 




< M < 1. 



(2.5) 
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Comparison distributions and density concepts can be used to compare two discrete distri- 
butions ( with pmfs pf and pn) of a variable (e.g., for the variable age in hepatitis data) by 

setting 

1 ^ 

d{u- H, F) = D{u- H,F)= [ d{s, H, F) ds, 0<u< 1. (2.6) 

PH[rl \u)) J 



PP-Plots. 

Suppose that H is discrete and Xi < X2 < ■ ■ ■ < Xr are the probability mass points of H. Then 
one can easily verify that D{u;H,F) is a piecewise linear between its values at Uj — H{xj) 
and that 

D{uf H, F) = F{H-\u,)) = F{x^), for j = 1, 2, . . . , r. 

The graph of D{u;H,F) is called the PP-Plot. It joins the points {H{uj), F{uj)), j — 
1, 2, . . . , r by linear interpolation. 

Why is PP-Plot Important for Variable Selection ? 

First note that identifying an important variable X and testing Hq : F{x) = H[x) are equiv- 
alent problems. If F H, it indicates that we cannot hope to extract any meaningful 
information from this particular variable. Now, to see that it is also related to the comparison 
distribution make the change of variable u — H{x) or x — H~^{u), to express the hypothesis 
to be tested as 

Ho : F{H-\u)) = u, or Hq : D{uj; H, F) = u. 

Figure 3. shows how this hypothesis can be visually tested using the PP-Plot. Looking at the 
PP-plot of Panel B ( Figure 3), we can conclude that gene # 610 of the prostate cancer data 
is probably an uninteresting variable. We can therefore measure importance of a variable in 
terms of the norm of D{u) — u or d{u) — 1 or by other functional of the comparison density. 
To make this concept more precise, we introduce a class of specially designed score functions 
and an orthogonal series model for d{u). 

2.3 Estimating the Comparison Density 

In this section we construct an estimator of comparison density using orthogonal mid- distribution 
score functions as basis functions, which will be used heavily to build measures of importance 
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and thus for identifying interesting variables. In the context of variable selection, the sole pur- 
pose of the comparison density is to indicate the nature of discriminatory information hidden 
in a variable. The various shapes of the comparison density gives answers to the questions 
why & how a variable is important and indicate direction for follow-up scientific investigation. 
For example, a quadratic pattern (Panel A of Figure 4(b)) of the comparison density would 
suggest that the variable has high second order information, which means it shows highly 
significant difference in variability between two classes. It is the job of the biologist to explain 
this statistically significant discriminative pattern (see Remark 1). 
Orthogonal Series Density Estimation of the Comparison Density 
We consider the following expansion of comparison density 



d{u):=J2^kSk{u),0<u<l, (2.7) 

k=0 

where {S'fc('u)}^^ forms a complete orthonormal basis for L^[0, 1]. So that 

1 

ek = J Skiu)diu) du = E{SkiU)), 



since d{u) du = 1. In practice, the comparison density can be approximated by a finite 
series 

M 

d{u-9) = Y,dkSk{u),{)<u<l, (2.8) 

A;=0 

for a suitable choice of M, called truncation point. 

Motivation: How to design score functions 

A representation of the score coefficient is given by. 



9k = J d{u) Sk{u) du = JSk{u)f{H-\u))/h{H-\u))du 



oo 

= J Sk{H{x))f{x) dx = E[Sk{H{X))\Y = l], (2.9) 

— oo 

The two sample Wilcoxon rank-sum test 1/ni Yl^Li Rj is expressed in terms of the mid-rank 
transformation {H^^^(-)) or (Rj — .5)/n, with mean .5 and variance given by (see Parzen 
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(d) (e) 

Figure 3: PP-Plot and Comparison Density : Plot (a) and (b) denotes the two class distri- 
bution of Gene #610 and # 4OI8 of the Prostate cancer data set. Corresponding PP-Plots 
are shown in (b),(d) which is the plot of D{u; H, F) = D{u] H, F) . Plot (c) and (e) the the 
corresponding smooth comparison density estimate d{u). The flat comparison density of plot 
(e) indicates that the distribution of Cene # 4OI8 barely vary over the two classes which leads 
to the conclusion that Cene # 4OI8 contains no information for class and can be considered 
as noise. Whereas Cene # 610 shows a considerable departure from uniformity and its shape 
indicates that it carries second and higher order information. 
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where x'^ denotes the ith distinct value of X. We use the following representation of Wilcoxon 
statistics in terms of mid-rank transformation, define 

rii 

Wil = ni5];[(i?,-.5)/n] = E[if"^i^(X)|r = ll (2.11) 

Eq. (2.10) and (2.11) motivate us to define score function Si{u) = cr^i^iu — .5), where u = 
H^^'^{x), to represent a version of the Wilcoxon statistic (linearly equivalent version or nor- 
malized, which is asymptotically normal as a score statistic, given by Si{u) d{u) du, where 
d{u) = d{u; H,F), a raw estimator of d{u). This new interpretation of Wilcoxon statistics in 
terms of a (linear) functional of the comparison density with appropriately chosen score func- 
tion, opens up new possibilities to design effective score functions to build powerful variable 
selection detectors, which we will explore in details in the following sections. 
Construction of score functions. 

Novelty of our approach is in the construction of the basis functions. In contrast to the 
standard practice of taking the basis as powers of x, here we construct orthonormal score 
functions based on ranks through mid-distribution transform. Define 



^mid 



» - 1/2) 



Si{x) = ^ ^, xgM (2.12) 

and then sequentially construct Sk{x),k = 1, 2, ... M by Gram-Schmidt ortho-normalization 
of S^{x),k = 1,2,...M. 

3 Interpretation and Insight 

The usefulness of d{u] H, F) comes from the following fact 

d{u; H,F) = l iff F{x) = G{x), for all x. (3.1) 

Figure 3e illustrate this behavior. The flat comparison density indicate that the variable is 
pure noise as a detector. But the more interesting phenomena is the case where the variable 
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Comparison Density tor Gene# 332 




Figure 4: Shapes of Score functions and the induced Comparison density : (a) describes the 
shapes of the first four score functions Si{u), . . . , 6*4 (-u). (b) L2 comparison density estimate 
using the four score functions. The shape of the d{u) for Variable # 18 indicates that it has 
a information in the tail behavior (or the fourth order moment). 



13 



is influential. In that case d{u) takes varieties of interesting shapes. The following theorem 
gives a justification of our score statistics which will be defined soon, 

Theorem 1 {Goodness of Fit Interpretation). The variable importance can be quantified by, 



i 

/oo 
k=l 



(3.2) 



Proof. Proof directly follows from Parseval's identity. □ 

Here O^s are the L2 parameters of the density. It states that J2h=i ^fc' "with a proper choice 
of M, acts as a measure of how uniform the comparison density is, which in turn assigns an 
importance measure to a particular variable. The clever construction of our basis function 
opens another important interpretation in terms of nonparametric rank correlation. §1 can be 
neatly rewritten as the following which add a nonparametric flavor to it. 

Theorem 2 {Rank Correlation Interpretation). Let Corr[r, Si{ if™^(X))] = R[Y, Si{H^^'^{X))] . 



Then we have the following correlation interpretation 

§1 = V(l-7r)/7r R \y, 5i( if™^(X)) 1 = Wil(X) 



(3.3) 



Proof. Note that, 
R 



Y,Si{H'^''^{X)) 



E 



Si{H'^''^{X))Y 



(l-vr) 



TT 



TT 



1 - TT 



E 



5i(if"^i^(X))|r = 1 



Comparing this with Eq. (2.9) completes the proof. 



(3.4) 

□ 



Implications of this result are : (i) The Wilcoxon statistic can be interpreted as a correlation 
between Y and Si{X). In this sense, it is a linear detector, (ii) The estimate of 61 has 
a nonparametric rank correlation interpretation by virtue of the special choice of the basis 
functions that we have introduced in the previous section. 

The representation (3.3) motivates us to propose Criterion Correlation(CR) statistics, which 
is an important generalization, yielding a general nonlinear rank based detector: 

M 



CR = J2\r{y, Sa{H'^''^{X)) 

a=l 



(3.5) 
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3.1 Component Correlation Interpretation 

The measure that we have introduced in Eq 3.5 have a third interpretation from component 
correlation perspective. To define what we define by components, let's first choose score 
functions Si{u), S2{u), . . . Sm{u) satisfying 

1 1 
j S{u) d{u) = 0, and j S\u) d{u) = 1. 



Now we form the components or linear detectors j = 1, . . . , M by 

1 

f{Sj) =E{S^{u)) = J Sj{u) d{D{u) - u) =< Sj,d>, (3.6) 



which is often shown to follow asymptotically normal, under Hq. It is easy to verify using 
Eq (2.11) and Theorem 3.3 that the Wilcoxon statistic (linearly equivalent) has a following 
equivalent representation in terms of functional of the comparison density empirical process, 

1 

<Si{u),d> = j Si{u)d{b{u)-u), (3.7) 



where Si{u) = cr^-^{u — .5). Under this new representation, Wilcoxon rank-sum test can be 
viewed as just the first component of our proposed test statistic (Eq 3.5). Further more, many 
important classical test statistics to test the equality of distributions {Hq : F{x) = if(x),Vx) 
can be represented as weighted sums of squares of suitable components (which is also known 
as a quadratic detector): 

oo 

^{wj < Sj,d>}^ . 
i=i 

For example Wj = l/(j7r) and Sj{u) = \/2cos{j'Ku) gives the famous Cramer- Von Mises 
statistic. One reason to prefer this form of expressing nonparametric statistics is that they 
help to identify sources of significance. The components can tell us how the behavior of 
a specific variable is different in two different classes. The key concepts is the comparison 
distribution and score function which enable us to unify and choose between diverse statistics 
available for variable selection. Apart from integrating the different concepts, we are now 
in a position to deliver valuable insight about the operation of significant variables which 

15 



could have significant impact on scientific understanding. In contrast conventional off-the- 
shelf variable selection machinery like t-statistics, Lasso, Wilcoxon statistics, simple Pearson 
correlation measure, etc. have a limited practical utility in the line of our intended application. 
In the next section, we will discuss few properties of the key stochastic process, comparison 
distribution empirical process (introduced in Parzen (1983)) to derive asymptotic distribution 
of our proposed variable selection statistic. 

4 Comparison Distribution Empirical Process: Weak Convergence 

and Limit Theory 

Rigorous starting point for our main result is provided by the following fundamental theorem 
for two sample comparison density empirical process (Pyke and Shorack, 1968, Parzen, 1999). 

Theorem 3 {Comparison Density Empirical Process). Assume that F and G respectively 
have positive continuous derivatives f and g respectively and that also \imn^oo{fii/n) = vr G 
(0, oo). Suppose further that d{u; H, F) and d{u; H, G) are bounded on any (a, b) C (0, 1). We 
denote by Mp and Mq two independent uniform Brownian Bridges, given by Mpiu; H, F) = 
Mf{F{H~\u))) andMG{u;H,G) =Mg{G{H-\u))). Then, as n oo. 



^{d{u-H,F)- D{u-H,F)^ 



(1-71)1 -^diu; H, G)B^(m; H, F) diu; H, F)Mg(u; H,G)\ ,0 <u<l. 

Note that under Hq : F = G = H , y/n {d{u] H, F) - D{u; H, F) j turns out to be y/n{D{u) - 
u), where D{u) = D{u] H, F). Our aim here is to derive the asymptotic distribution of the 
J{u) d(D{u) — u), for suitable linear functional of comparison density empirical process. 

Theorem 4 (Asymptotic Distribution). Define A(J) = y/n | J{u) dD{u) — J{u) dD(M)| 

Then under Hq, A(J) is asymptotically normal with covariance kernel Kjy{Ji, J2), is given by 

1 1 
Ji{u)J2{u) du — J Ji{u) du j J2{u) du 
00 

For proof see Parzen (1983). This results readily gives us the following useful corollary for the 
limit distribution of our rank based statistics. 
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Corollary 1 {Null Density). nP — )• Xm- 

Proof. Our designed score functions are orthonormal, i.e., in our case Ji{u)J2{u) du = 



and Jf{u) du = l. This implies that R Y, 5i( 



rmid/ 



are asymptotically iid and 



A/'(0,1). (4.1) 
Result follows from the definition of the CR-statistics (Eq. 3.5 ). □ 
5 Algorithm and Illustration 

Before going any further, at this point we will take a pause and revisit the Prostate data 
example to see how to apply the previous theory to come up with an importance score using 
the CR-statistic for each of the 6033 variables. 

(i) . Transformation. The first step is to transform raw data matrix X to U, mid-rank trans- 

formed values ( see Eq 2.3). 

(ii) . Basis Expansion. Construct score functions Sk{ H^^'^{Xk) ), A; = 1, . . . , 4 for each vari- 

able. Panel A of Figure 4 shows the generic shapes of first four score functions for the 
prostate cancer data. This score functions can be thought of as a kernel for projecting 
the raw data from p dimension to 4p dimension. 

(iii) . Computing CR-statistic. Once we compute the sufficient statistics (score functions) for 

each features, we use Eq. 3.5 and generate the CR-statistics for each variable. 

(iv) . Ranking & Categorizing. We can rank the variables according to the values of the CR- 

statistic (see Fig 2) and select a 'proper threshold' to identify few infiuential variables 
from the list of 6033 variables. More importantly we can categorize the important 
variables according to their 'discriminatory role' played by the variables. Table 2 not 
only finds the top interesting variables but also label the variables according to 'what 
variable is contributing to what type of information'. 

(v) . Interpretation. Lastly plot the comparison density estimate for the top few interesting 

variables to demonstrate graphically 'how and why' those variables are important. The 
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Table 2: Top ranked variables under different Category: Prostate Cancer Data 



i?2(Mean) 


-R| (Variance) 


i?3 (Skewness) 






452 


614 


16 


614 


332 


411 


1546 


669 


77 


77 


739 


377 


423 


332 


614 


4552 


1139 


24 


808 


579 



various shapes of comparison density (see Figure 3 (c,d) and 4 (b)) conveys 'why' a 
particular variable is interesting for scientific understanding. 

What's remain is the following question: " How many top variables to select ? ". The next 
Section is dedicated primarily to build a simple yet powerful theoretical set up to attack this 
problem. 

6 Separating Signal from Noise 

Problem Statement 

The problem that we want to tackle now, has a very general setup, which is commonly known 
as detection/threshold selection problem. Figure 2 shows the plot of the sorted values of 
the CR-statistic and the problem is to select a proper threshold. To control the number of 
falsely rejected hypotheses, Benjamini and Hochberg (1995) introduced the false discovery 
rate (FDR) and described a procedure to control FDR. There are many variants of FDR 
that have been introduced in the past decade (mFDR,pFDR,fdr). Efron (2004) introduced 
an alternative measure local false discovery rate (Locfdr/fdr) based on a two-group model to 
control FDR from a density estimation perspective. The local false discovery rate is defined 
as 

fdr(2) = Pr{null | Z = z} = Po|^, (6.1) 

f(z) 

where Pr Null = po , fo is the theoretical null density, /i density of interesting variables and 
f{z) = pofo + (1 — Po)fi, pooled density. Recently, Muralidharan (2010) proposed Mixfdr 
method, based on the hierarchical empirical Bayes mixture model to accurately estimate fdr. 
A few key observations on the fdr literature are: 
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(i) The main focus of current research on fdr concerns flexible estimation of f{x), using either 
exponential model or sphnes or fancy mixture models. 

(ii) At the second stage, the estimator for fdT{z) is constructed (Eq 6.1) taking the ratio 
foi^z)/ f{x), where /o is the theoretical null and assuming po = Pr(Null) = 1, without much 
harm (cf. Efron et al. (2001), Efron (2004) for details). Broadly speaking, the estimation of 
fdr is accomplished in a two steps, first estimating the f{x) and then taking the ratio. 

As we shall argue in the next section that there is at least two fundamental flaws with this 
two-step approach for fdr estimation. First, we look at some simulation example. 

Example 1 {Gaussian Model with Contamination) . Consider Zi ~ ■f^ifJ'i, 1), = 1, 2, . . . 1000 
and we have M nonzero components /x, = 4.52. This example is similar to the one considered in 
Abramovich et al. (2006). We have generated M non-zero significant components once for all 
and then at each simulation iteration we have added additional (1000 — M) noise and the task 
is to recover the important signal for the corrupted version. Figure 5 illustrates the result 
of a simulation study for Gaussian shift model. It compares three competing methods for 
threshold selection (i) Locfdr (Efron, 2004), (ii) Mixfdr (Muralidharan, 2010) and comparison 
density based fdr(CDfdr). It is evident that CDfdr does a better job of finding the true number 
of nonzero components compared to other two methods and adapts quite successfully to the 
underlying sparsity level. A surprising fact is that the two competing methods consistently 
underestimate the true number. A possible explanation of this false negative phenomena and 
a simple solution to it is explained in Section 6.1. 

Example 2 {Non- Gaussian Model with Contamination ). Consider the problem where the 
M nonzero signals were drawn from Uniform[2, 4] (once and for all) and (1000 — M) Gaussian 
white noise were added to it. At each simulation run, we generate random noise and we mix 
it with the signal. This example was used in Muralidharan (2010). Our aim to study how 
successful these three methods are for finding out the true number of significant signals and 
as a next step, investigate few plausible reasons for under-performance. Figure 6 demonstrate 
the out-performance of our method in contrast with other two methods. 
The comparison density based fdr (CDfdr), which we will describe in the following section 
appears to be the most reliable and straightforward technology for the variable selection in 
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Locfdr Miifdr CDfdr 

LocMr Miifdr CDIdr 

Figure 5: Gaussian Shift model : Left panel M=25, Zi was drawn from A/'(4.52, 1) (once for 
all). Where as the Right panel has 50 Zi generated from A/'(4.52, 1) (once for all). The figure 
shows the hoxplot of the number of fii was selected as nonzero over 100 simulation run. 

the examples above. The objective of the next section is to explain how we derive and compute 
the CDfdr. In fact, the nonparametric approach developed here, seems applicable to a wider 
class of detection problems having non-gaussian noise. 

6.1 Rationale & Main Contributions 

The first crucial observation is that estimating directly the ratio of two density is much more 
efficient and straightforward rather than estimating them separately and then taking the 
ratio, (i) One sample Comparison density: As a remedy, we apply idea of comparison density 
introduced in Section 3 in one sample case, as a means of devising a direct procedure to 
estimate the proper ratio of the two density. We prefer to work with the inverse-fdr, f/fo 
as opposed to fo/f. There is another factor that merits our attention, {li) Pre- flattened 
smoothing (Parzen (1979)) Even if we reduce the estimation ratio of two density into a single 
step process, the traditional exponential density estimation typically does not work. This is 
due to the fact that /, the pooled density is fat tailed compare to the null distribution /o 
which makes Support (/o) C Support (f) and introduce extra challenge. This directly affects 
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Locfclr Mixfdr GDfdr 

Figure 6: Non-gaussian Model with Contamination: Three boxplots comparing the performance 
of variable selection efficiency over 100 simulation run . 

the tail portion of ///o- Efficient estimation of the tail probability depends upon reducing 
the dynamic range (sharp peak) at the corners and this calls for some new techniques to 
accurately estimate the tail portion. To get an estimate of the fdr, we are only going to focus 
on the tail estimation, rather than estimating ///o on the full support domain, as the central 
part will not contribute anything to estimate the fdr for significant variables. 
To our knowledge, this connection between tail of Comparison density and fdr estimation is 
new and opens many avenue for further investigations. 

6.2 Comparison Density based fdr : Goodness of fit Problem 

One sample Goodness of fit formulation 

Let Ti,T2, . . . ,Tp be random sample from F. In Section 2.2 we have introduced two sample 
comparison density analysis and now we are going to introduce the one sample version of it. 
The basic task of estimating fdr starts with the hypothesis testing question that F(t) = Foit), 
where Fo is the specific distribution either known theoretical null or the empirically esti- 
mated version of it. The probability integral transformation U = -Fo(-^) transforms the data 
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Z-score for Prostate data 
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Figure 7: CDfdr Algorithm : We transform the raw Z-score vales to Ui = ^Hoi^i)- The 
distribution of Ui,U2, ■ ■ - Up directly estimate the ratio f {^Jj^^(u)) / fo{^]jl{u)) = l/fdr('u) = 
d{u). Directly estimating the density of^Hoi^i) is difficult due the the presence of the sudden 
sharp peak at the both end (which signifies the presence of the signal). So instead we perform 
pre-flattening transformation using Ui = = $ [(x — /to) /<5"o)]- Where the parameters 

are estimated from the pooled data. After this transformation we have a better chance to model 
it efficiently as it reduces the dynamic range at the boundary considerably. 
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Ti, T2, . . . , Tp, to a pseudo-data Ui, U2, ... ,Up on unit interval, which allows us to reformulate 
the problem into a problem of testing uniformity in a unit interval. This is one of the most 
important canonical problem of statistics and is of independent interest. 

Example 

Bottom left panel of Figure 7 shows the histogram of ^{zi),i = 1,2, ... ,p, where Zi is the 
z-score for the ith gene in the Prostate dataset. There is a clear departure of uniformity 
specially in the two corners. 

Direct Approach to Ratio Estimation 

Here we introduce a novel method that automatically estimate the ratio of two density in a 
single step and render some useful modeling and analysis convenience. The key is the following 
simple and yet deep fact about comparison density d{u) for one sample, 

Proposition 1. Density of U is /(Fq"^('u))//o(Fq~^(-u)) = d{u), where u = Fo{t). 

Proof. To see why this is the case first look at the corresponding distribution function 

P(Fo(Z) <u)= P(Z < F,\u)) = FiF,\u)), 

which implies that {F{Fq\u)))' = f{Fo\u))/fo{Fo\u)) = d{u). □ 

This is the main reason why we prefer to work with the inverse-fdr. Now let's look at a simple 
algorithm to estimate the fdr based on the previous observation. 
Simple Algorithm 

I. Transform z Fhq{z) = u. 

II. Estimate density of u. This will give us the ratio of two density as a function of u. 

III. Find {u ; d{u) > 5}, as the conventional threshold for reporting signal/interesting variables 
is fdr(z) < .2. 

Nonparametric Pre-whitening (Parzen, 1979) 

Proposition 1 reduce the problem of fdr estimation into a single density estimation problem. 
But conventional nonparametric density estimation encounter roadblock. Bottom left panel of 
Figure 7 shows a typical shape of the p-values near the boundary. We expect this sudden peak 
in the variable selection scenario as the signal resides on the corner. Traditional nonparametric 
density estimators including exponential density performs poorly to capture the tail. Towards 
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this, we may note that fdr(M) is specially important for m's near the boundary as we expect 
to have significant variables at those regions. To overcome the challenge pose by this large 
dynamic range of the d near the corners we propose a nonparametric pre-fiattened smoothing 
technique . Bottom right panel Figure 7 illustrate the stabilizing/fiattening effect which is 
the histogram of -Fo(^)- Here Fo{z) can be interpreted as a pooled distribution under Hq. For 
prostate data, the mean and standard deviation for the pooled z- values turns out to be and 
1.135 and we choose $(a;/1.135) as Fo{z). We introduce artificially the Fo{z) and write the 
ratio as, 

fix) ^ /o(^) fix) /^^x 
foix) foix)Ux)' 

Note that the we have factorized the original ratio into two parts. In the first part there is 
no approximation error as we exactly know the function /q and /o, so there is no estimation 
exercise. This works as a adjusting weight. Only the second part involves the density ap- 
proximation. But now the good news is that, the simple trick of iterative density estimation 
through pre-flattening enables us to estimate the residual-ratio /(x)//o(a;), much more effi- 
ciently compared to the original version /(a;)//o(a;). In a broad sense our fiattening technique 
could be considered as a regularization method for estimating ratio of two density. 

Remark 3 (Novel Empirical Bayes Connection). Note that our approach is free from any 
assumption of mixture model. It involves no tuning parameter and is a fully data analytic 
approach. As ratio of two density is a integral part of Bayes rule, it is conceivable that our 
novel rank based methodology for estimating the ratio in a fully nonparametric way opens a 
powerful implication and interpretation from empirical Bayes prospective. 

6.3 Back to the Prostate Data 

In this section we will mention few findings for prostate cancer data and compare with other 
competing methods, fdr at the level .2 was used for threshold selection and po was assumed 
to be 1. All of the method used the estimated empirical null as opposed to theoretical null 
model. For our case we have used the central matching estimation technique of Efron (2004). 
Table 3 compares the variable selection performance of Locfdr, Mixfdr and CDfdr on the basis 
of Z-score. Whereas , Table 4 compares the threshold selection procedure where we have use 
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Table 3: Prostate Cancer data: Number of Variables selected using Z-score 



Locfdr 


Mixfdr 


CDfdr 


54 


49 


46 



Table 4: Prostate Cancer data: Number of Variables selected using CR-statistic 



Locfdr 


Mixfdr 


CDfdr 


13 


10 
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only the first two components of our CR-statistic (Eq. 3.5). R packages "locfdr" and "mixfdr" 
was used to implement the Locfdr and Mixfdr methodology. 

Using our similar methodology we can inspect the higher order variable screening and thresh- 
olding performance exactly in the same way. 

7 Discussion &; Future Direction 

Although the main goal of this article is variable selection the implicit aim is to demonstrate 
the effectiveness of quantile based comparison analysis as a means to generalize, unify different 
methodology for high dimensional inference. We made a effort to synthesize powerful applica- 
ble methodology combining several novel ideas of robust nonparametric statistics which can 
adapt for different data types. These development will possibly pave the way for modern 
analysis of high dimensional data using quantile, comparison density, mid distribution score 
function. One of the most attractive feature for our variable selection methodology is that, 
it generates explanation in terms of graphical tool. This graphical element of our method 
helps to translate numbers to shapes and thus act as a important diagnostic tool. In Section 4 
we have studied the novel stochastic process, called two sample comparison density empirical 
process (introduced by Parzen (1983)) and studied the asymptotic properties of our statistic. 
It has shown that, linear rank statistics, the Cramer- von Mises, Anderson-Darling and many 
more can be studied in a unified way and can be conveniently represented as a functional of 
this fundamental process. In Section 6.2 we have neatly rewritten the fdr in terms of compar- 
ison density and established a novel connection with Empirical Bayes method. We have left 
untouched several important aspect including correlated variable selection, which is currently 
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under investigation. However, we believe that some key ideas presented in this article might 
be extended much further than we have managed to do. For example, in trying to describe 
fdr we have introduced the Pre-flattcncd smoothing approach in conjunction with one sample 
comparison density concept. We can use this idea for a general purpose semi-par am ctric den- 
sity estimation technique where we can choose any reasonable parametric distribution to be 
our flattering function. We hope the present article will contribute to a better understanding 
of variable selection from a broader perspective. 
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